14 research outputs found

    A differentiable BLEU loss. Analysis and first results

    Get PDF
    In natural language generation tasks, like neural machine translation and image captioning, there is usually a mismatch between the optimized loss and the de facto evaluation criterion, namely token-level maximum likelihood and corpus-level BLEU score. This article tries to reduce this gap by defining differentiable computations of the BLEU and GLEU scores. We test this approach on simple tasks, obtaining valuable lessons on its potential applications but also its pitfalls, mainly that these loss functions push each token in the hypothesis sequence toward the average of the tokens in the reference, resulting in a poor training signal.Peer ReviewedPostprint (published version

    Chinese-Catalan: A neural machine translation approach based on pivoting and attention mechanisms

    Get PDF
    This article innovatively addresses machine translation from Chinese to Catalan using neural pivot strategies trained without any direct parallel data. The Catalan language is very similar to Spanish from a linguistic point of view, which motivates the use of Spanish as pivot language. Regarding neural architecture, we are using the latest state-of-the-art, which is the Transformer model, only based on attention mechanisms. Additionally, this work provides new resources to the community, which consists of a human-developed gold standard of 4,000 sentences between Catalan and Chinese and all the others United Nations official languages (Arabic, English, French, Russian, and Spanish). Results show that the standard pseudo-corpus or synthetic pivot approach performs better than cascade.Peer ReviewedPostprint (author's final draft

    Combining subword representations into word-level representations in the transformer architecture

    Get PDF
    In Neural Machine Translation, using word-level tokens leads to degradation in translation quality. The dominant approaches use subword-level tokens, but this increases the length of the sequences and makes it difficult to profit from word-level information such as POS tags or semantic dependencies. We propose a modification to the Transformer model to combine subword-level representations into word-level ones in the first layers of the encoder, reducing the effective length of the sequences in the following layers and providing a natural point to incorporate extra word-level information. Our experiments show that this approach maintains the translation quality with respect to the normal Transformer model when no extra word-level information is injected and that it is superior to the currently dominant method for incorporating word-level source language information to models based on subword-level vocabularies.This work is partially supported by Lucy Software / United Language Group (ULG) and the Catalan Agency for Management of University and Research Grants (AGAUR) through an Industrial PhD Grant. This work is also supported in part by the the Spanish Ministerio de Economía y Competitividad, the European Regional Development Fund through the postdoctoral senior grant Ramón y Cajal and by the Agencia Estatal de Investigación through the project EUR2019-103819.Peer ReviewedPostprint (published version

    Evaluating the underlying gender bias in contextualized word embeddings

    Get PDF
    Gender bias is highly impacting natural language processing applications. Word embeddings have clearly been proven both to keep and amplify gender biases that are present in current data sources. Recently, contextualized word embeddings have enhanced previous word embedding techniques by computing word vector representations dependent on the sentence they appear in. In this paper, we study the impact of this conceptual change in the word embedding computation in relation with gender bias. Our analysis includes different measures previously applied in the literature to standard word embeddings. Our findings suggest that contextualized word embeddings are less biased than standard ones even when the latter are debiased.We want to thank Hila Gonen for her support dur-ing our research.This work is supported in part by the Catalan Agency for Management of University andResearch Grants (AGAUR) through the FI PhDScholarship and the Industrial PhD Grant. Thiswork is also supported in part by the Span-ish Ministerio de Economa y Competitividad, the European Regional Development Fund andthe Agencia Estatal de Investigacin, through thepostdoctoral senior grant Ramn y Cajal, contract TEC2015-69266-P (MINECO/FEDER,EU)and contract PCIN-2017-079 (AEI/MINECO).Peer ReviewedPostprint (published version

    Linguistic knowledge-based vocabularies for Neural Machine Translation

    Get PDF
    This article has been published in a revised form in Natural Language Engineering https://doi.org/10.1017/S1351324920000364. This version is free to view and download for private research and study only. Not for re-distribution, re-sale or use in derivative works. © Cambridge University PressNeural Networks applied to Machine Translation need a finite vocabulary to express textual information as a sequence of discrete tokens. The currently dominant subword vocabularies exploit statistically-discovered common parts of words to achieve the flexibility of character-based vocabularies without delegating the whole learning of word formation to the neural network. However, they trade this for the inability to apply word-level token associations, which limits their use in semantically-rich areas and prevents some transfer learning approaches e.g. cross-lingual pretrained embeddings, and reduces their interpretability. In this work, we propose new hybrid linguistically-grounded vocabulary definition strategies that keep both the advantages of subword vocabularies and the word-level associations, enabling neural networks to profit from the derived benefits. We test the proposed approaches in both morphologically rich and poor languages, showing that, for the former, the quality in the translation of out-of-domain texts is improved with respect to a strong subword baseline.This work is partially supported by Lucy Software / United Language Group (ULG) and the Catalan Agency for Management of University and Research Grants (AGAUR) through an Industrial PhD Grant. This work is also supported in part by the Spanish Ministerio de Economa y Competitividad, the European Regional Development Fund and the Agencia Estatal de Investigacin, through the postdoctoral senior grant Ramn y Cajal, contract TEC2015-69266-P (MINECO/FEDER,EU) and contract PCIN-2017-079 (AEI/MINECO).Peer ReviewedPostprint (author's final draft

    Injection of linguistic knowledge into neural text generation models

    Get PDF
    Aplicat embargament des de la data de defensa fins a l'1 de febrer e 2021Language is an organic construct. It emanates from the need for communication and changes through time, influenced by multiple factors. The resulting language structures are a mix of regular syntactic and morphological constructions together with divergent irregular elements. Linguistics aims at formalizing these structures, providing a rationalization of the underlying phenomena. However, linguistic information alone is not enough to fully characterize the structures in language, as they are intrinsically tied to meaning, which constrains and modulates the applicability of the linguistic phenomena and also to context and domain. Classical machine translation approaches, like rule-based systems, relied completely on the linguistic formalisms. Hundreds of morphological and grammatical rules were wired together to analyze input text and translate it into the target language, trying to take into account the semantic load carried by it. While this kind of processing can satisfactorily address most of the low-level language structures, many of the meaning-dependent structures failed to be analyzed correctly. On the other hand, the dominant neural language processing systems are trained from raw textual data, handling it as a sequence of discrete tokens. These discrete tokens are normally defined looking for reusable word pieces identified statistically from data. In the whole training process, there is no explicit notion of linguistic knowledge: no morphemes, no morphological information, no relationships among words, or hierarchical groupings.This thesis aims at bridging the gap between the neural systems and linguistics-based systems, devising systems that have the flexibility and good results of the former with a base on the linguistic formalisms, with the purposes of improving quality where data alone cannot and forcing human-understandable working dynamics into the otherwise black-box neural systems. For this, we propose techniques to fuse statistical subwords with word-level linguistic information, to remove subwords altogether and rely solely on lemmas and morphological traits of the words, and to drive the text generation process on the ordering defined by syntactic dependencie. The main results of the proposed methods are the improvements in translation quality that can be obtained by injecting morphological information into NMT systems when testing on out-of-domain data for morphologically-rich languages, and the control over the generated text that can be gained by means of linking the generation order to the syntactic structure.El lenguaje es una construcción orgánica que surge de la necesidad de comunicación, y que cambia a lo largo del tiempo, influenciado por múltiples factores, resultando en estructuras del lenguaje donde se mezclan construcciones morfológicas y sintácticas regulares con otros elementos irregulares. La lingüística tiene como objetivo el formalizar estas estructuras, proponiendo interpretaciones de los fenómenos subyacentes. Sin embargo, la lingüística no es suficiente para caracterizar de manera completa las estructuras del lenguaje, ya que éstas se encuentran intrínsicamente ligadas tanto al significado -al restringir y modular éste la aplicabilidad de los fenómenos lingüísticos- como al contexto y al dominio. Las técnicas de traducción automática clásicas empleadas por los sistemas basados en reglas, se basan en formalismos lingüísticos, haciendo uso de miles de reglas morfológicas y gramaticales para analizar texto del idioma de origen y traducirlo al idioma de destino, intentando mantener la carga semántica original. Aunque este tipo de traducción procesa adecuadamente la estructuras de bajo nivel del lenguaje, muchas estructuras dependientes del significado no son analizadas correctamente. Los sistemas de procesado del lenguaje natural dominantes, en cambio, se entrenan usando texto como datos de entrada. Dicho texto se procesa como una secuencia de elementos discretos, normalmente definidos como trozos de palabras o sub-palabras, que se agrupan en una estructura de diccionario que es confecccionado estadísticamente de modo que se maximice el reuso de sus sub-palabras al codificar el texto de entrenamiento. En todo este proceso, no hay ninguna noción explícita de conocimiento lingüístico, ni morfemas, ni información morfológica, ni relaciones sintácticas entre palabras o grupos jerárquicos. El objetivo de esta tesis es hibridizar los sistemas neuronales y los sistemas basados en reglas lingüísticas, de manera que el resultado pueda mostrar la flexibilidad y buenos resultados de los primeros, pero teniendo una base lingüística que le permita tanto mejorar la calidad del texto generado en los casos en los que simplemente más datos no lo consiguen, como establer unas dinámicas de funcionamiento internas que sean entendibles por humanos, a diferencia de la naturaleza de "caja negra" de los sistemas neuronales normales. Para ello, se proponen técnicas para enriqueces las sub-palabras con información lingüística de nivel de palabra, ténicas para prescindir de las sub-palabras y basarse únicamente en el lema y los rasgos lingüísticos de las palabras, y técnicas para dirigir el orden de generación de texto mediante dependencias sintácticas. Los principales resultados de los métodos propuestos son la mejora en la calidad de traducción en sistemas neuronales a los que les inyectamos información lingüística, especialmente en escenarios de lenguas morfológicamente ricas con texto de distinto dominio, y el control directo del proceso de generación al ligarlo a las estructuras sintácticas del texto.Postprint (published version

    Evaluating the underlying gender bias in contextualized word embeddings

    No full text
    Gender bias is highly impacting natural language processing applications. Word embeddings have clearly been proven both to keep and amplify gender biases that are present in current data sources. Recently, contextualized word embeddings have enhanced previous word embedding techniques by computing word vector representations dependent on the sentence they appear in. In this paper, we study the impact of this conceptual change in the word embedding computation in relation with gender bias. Our analysis includes different measures previously applied in the literature to standard word embeddings. Our findings suggest that contextualized word embeddings are less biased than standard ones even when the latter are debiased.We want to thank Hila Gonen for her support dur-ing our research.This work is supported in part by the Catalan Agency for Management of University andResearch Grants (AGAUR) through the FI PhDScholarship and the Industrial PhD Grant. Thiswork is also supported in part by the Span-ish Ministerio de Economa y Competitividad, the European Regional Development Fund andthe Agencia Estatal de Investigacin, through thepostdoctoral senior grant Ramn y Cajal, contract TEC2015-69266-P (MINECO/FEDER,EU)and contract PCIN-2017-079 (AEI/MINECO).Peer Reviewe

    A differentiable BLEU loss. Analysis and first results

    No full text
    In natural language generation tasks, like neural machine translation and image captioning, there is usually a mismatch between the optimized loss and the de facto evaluation criterion, namely token-level maximum likelihood and corpus-level BLEU score. This article tries to reduce this gap by defining differentiable computations of the BLEU and GLEU scores. We test this approach on simple tasks, obtaining valuable lessons on its potential applications but also its pitfalls, mainly that these loss functions push each token in the hypothesis sequence toward the average of the tokens in the reference, resulting in a poor training signal.Peer Reviewe

    Syntax-driven iterative expansion language models for controllable text generation

    Get PDF
    The dominant language modeling paradigm handles text as a sequence of discrete tokens. While that approach can capture the latent structure of the text, it is inherently constrained to sequential dynamics for text generation. We propose a new paradigm for introducing a syntactic inductive bias into neural text generation, where the dependency parse tree is used to drive the Transformer model to generate sentences iteratively. Our experiments show that this paradigm is effective at text generation, with quality between LSTMs and Transformers, and comparable diversity, requiring less than half their decoding steps, and its generation process allows direct control over the syntactic constructions of the generated text, enabling the induction of stylistic variations.This work is partially supported by Lucy Software / United Language Group (ULG) and the Catalan Agency for Management of University and Research Grants (AGAUR) through an Industrial Ph.D. Grant. This work also is supported in part by the Spanish Ministerio de Economía y Competitividad, the European Regional Development Fund through the postdoctoral senior grant Ramón y Cajal and by the Agencia Estatal de Investigación through the projects EUR2019-103819, PCIN2017-079 and PID2019-107579RB-I00 / AEI / 10.13039/501100011033Peer ReviewedPostprint (published version

    Extensive study on the underlying gender bias in contextualized word embeddings

    No full text
    Gender bias is affecting many natural language processing applications. While we are still far from proposing debiasing methods that will solve the problem, we are making progress analyzing the impact of this bias in current algorithms. This paper provides an extensive study of the underlying gender bias in popular contextualized word embeddings. Our study provides an insightful analysis of evaluation measures applied to several English data domains and the layers of the contextualized word embeddings. It is also adapted and extended to the Spanish language. Our study points out the advantages and limitations of the various evaluation measures that we are using and aims to standardize the evaluation of gender bias in contextualized word embeddings.This work is supported in part by the Catalan Agency for Management of University and Research Grants (AGAUR) through the FI PhD Scholarship and the Industrial PhD Grant. This work also is supported in part by the Spanish Ministerio de Economía y Competitividad, the European Regional Development Fund, the Agencia Estatal de Investigación through the postdoctoral senior grant Ramón y Cajal and the Projects EUR2019-103819, PCIN-2017-079 and PID2019-107579RB-I00.Peer ReviewedPostprint (published version
    corecore